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Abstract 

In this tutorial paper, a comprehensive survey is given on several major systematic approaches in 
dealing with delay-aware control problems, namely the equivalent rate constraint approach, the Lyapunov 
stability drift approach and the approximate Markov Decision Process (MDP) approach using stochastic 
learning. These approaches essentially embrace most of the existing literature regarding delay-aware 
resource control in wireless systems. They have their relative pros and cons in terms of performance, 
complexity and implementation issues. For each of the approaches, the problem setup, the general 
solution and the design methodology are discussed. Applications of these approaches to delay-aware 
resource allocation are illustrated with examples in single-hop wireless networks. Furthermore, recent 
results regarding delay-aware multi-hop routing designs in general multi-hop networks are elaborated. 
Finally, the delay performance of the various approaches are compared through simulations using an 
example of the uplink OFDMA systems. 
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Fig. 1. Illustration of cross-layer resource allocation with respect to both the MAC layer state (QSI) and PHY layer state (CSI). 

I. Introduction 

There is plenty of literature on cross-layer resource optimization in wireless systems. For 
example, there are papers on joint power and subcarrier allocations to maximize the sum 
throughput for OFDMA systems |T|, 0- There are also papers on joint power and precoder 
optimization to boost the sum rate, weighted sum MMSE or SINR for MIMO wireless systems 
|j3|, Q. All these papers illustrate that significant throughput gain can be obtained by joint 
optimization of radio resource across the Physical (PHY) and the Media Access Control (MAC) 
layers. However, a typical assumption in these papers is that the transmitter has an infinite backlog 
and the information flow is delay insensitive. As a result, these papers focus only on optimizing 
the PHY layer performance metrics such as sum throughput, MMSE, SINR or proportional 
fairness, and the resulting control pohcy is adaptive to the channel state information (CSI) only. 

In practice, it is very important to consider random bursty arrivals and delay performance 
metrics in addition to the conventional PHY layer performance metrics in cross-layer optimiza- 
tion, which may embrace the PHY, MAC and network layers. A combined framework taking 
into account both queueing delay and PHY layer performance is not trivial as it involves both 
queueing theory (to model the queue dynamics) and information theory (to model the PHY layer 
dynamics). The system state involves both the CSI and the queue state information (QSI) and the 
delay-optimal control policy should be adaptive to both the CSI and the QSI of wireless systems 
as illustrated in Fig. [T| This design approach is fundamentally challenging for the following 
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reasons. First, there may not be closed-form expressions relating the optimization objective (such 
as the average delay) and the optimization variables (power, precoder, etc). Second, it is not clear 
if the optimization problems are convex (in most cases, they are not convex). Third, there is 
the curse of dimensionality due to the exponential growth of the cardinality of the system state 
space as well as the large dimension of the control action space involved (i.e., set of actions). 
For example, consider a queueing network with queues, each with finite buffer size Nq. The 
size of the system state space is 0{Nq), which is unmanageable even for small number of users 
and buffer length Nq. 

There are various approaches to deal with delay-aware resource control in wireless networks 
||5|, Q. One approach converts average delay constraints into equivalent average rate constraints 
using the large deviation theory and solves the optimization problem using a purely information 
theoretical formulation based on the rate constraints [|7|-p2|. While this approach allows po- 
tentially simple solutions, the resulting control policies are only functions of the CSI and such 
controls are good only for the large delay regime where the probability of empty buffers is small. 
In general, optimal control policies should be functions of both the CSI and QSI. In addition, due 
to the complex coupling among queues in multi-hop wireless networks, it is difficult to express 
the average delay in terms of all the control actions. Therefore, it is not easy to generalize this 
approach to joint resource allocation and routing in multi-hop wireless networks. 

A second approach to deal with delay-aware resource control utilizes the notion of Lya- 
punov stability and establishes throughput-optimal control policies (in the stability sense). The 
throughput-optimal policies ensure the stability of the queueing network if stability can be indeed 
achieved under any policy. Three classes of policies that are known to be throughput-optimal 



include the Max Weight rule [13], the Exponential (EXP) rule [ 14 1 and the Log rule [ 1 5 1 . Among 



the three classes, the throughput-optimal property of the Max Weight type algorithms \\6l and the 



Log rule [15| are both proved by the theory of Lyapunov drift, whereas the EXP rule is proved 
to be throughput-optimal by the fluid limit technique along with a separation of time scales 
argument in [14|. Specifically, the general Max Weight type algorithms are proved to minimize 
the Lyapunov drift, and hence, are throughput-optimal. Many dynamic control algorithms belong 
to this type, which include optimizing the allocation of computer resources [ |17J , and stabilizing 



packet switch systems [18|-[21 1 and satellite and wireless systems [22|-[24|. The Lyapunov drift 



theory (which only focuses on controlling a queueing network to achieve stability) is extended 
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to the Lyapunov optimization theory (which enables stability and performance optimization to 
be treated simultaneously) p3| , [[25|-[27|. For example, utilizing the Lyapunov optimization 



theory, the Energy-Efficient Control Algorithm (EECA) proposed in [27| stabilizes the system 
and consumes an average power that is arbitrarily close to the minimum power solution with 
a corresponding tradeoff in network delay. In transport layer flow control and network fairness 
optimization, the Cross Layer Control (CLC) algorithm was designed in p5| to achieve a fair 
throughput point which is arbitrarily close to optimal with a corresponding tradeoff in network 
delay, when the exogenous arrival rates are outside of the network stability region. In |28l 
and [29], the authors consider the asymptotic single-user and multi-user power-delay tradeoff 
in the large delay regime and obtain insights into the structure of the optimal control policy 
in the large delay regime. Although the derived policy (e.g., dynamic backpresssure algorithm) 
by the Lyapunov drift theory and the Lyapunov optimization theory may not have good delay 
performance in moderate and light traffic loading regimes, it allows potentially simple solutions 
with throughput optimality in multi-hop wireless networks. However, throughput optimality is a 
weak form of delay performance and it is also of great interest to study scheduling policies that 
minimize average delay of queueing networks. 

A more systematic approach in dealing with delay-optimal resource control in general delay 
regime is the Markov Decision Process (MDP) approach. In some special cases, it may be 



possible to obtain simple delay-optimal solutions. For example, in [30|, [31 1, the authors utilize 
Stochastic Majorization to show that the longest queue highest possible rate (LQHPR) policy is 
delay-optimal for multiaccess systems with homogeneous users. However, in general, the delay- 
optimal control belongs to the infinite horizon average cost MDP, and it is well known that there 
is no simple solution associated with such MDP. Brute force value iterations or policy iterations 



p2[ , p3[ could not lead to any viable solutions due to the curse of dimensionality. In addition 
to the above challenges, the problem is further complicated under distributed implementation 
requirements. For instance, the delay-optimal control actions should be adaptive to both the 
global system CSI and QSI. However, these CSI and QSI observations are usually measured 
locally at some nodes of the network and hence, centralized solutions require huge signaling 
overhead to deliver all these local CSI and QSI to the centralized controller. It is very desirable 
to have distributed solutions where the control actions are computed locally based on the local 
CSI and QSI measurements. 
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A systematic understanding of delay-aware control in wireless communications is the key to 
truly embracing both the PHY layer and the MAC layer in cross-layer designs. In this paper, 
we give a comprehensive survey on the major systematic approaches in dealing with delay- 
aware control problems, namely the equivalent rate constraint approach, the Lynapnov stability 
drift approach and the approximate MDP approach using stochastic learning. These approaches 
essentially embrace most of the existing literature regarding delay-aware resource control in 
wireless systems. They have their relative pros and cons in terms of performance, complexity 
and implementation issues. For each of the approaches, we discuss the problem setup, the general 
solution, the design methodology and the limitations of delay-aware resource allocations with 
simple examples in single-hop wireless networks. We also discuss recent advances in delay-aware 
routing designs in multi-hop wireless networks. 

The paper is organized as follows. In Section [11} we elaborate on the basic concepts of cross- 
layer resource allocation, which consists of the system model, the source model, the control 



policies, the queue dynamics and the general resource control problem formulation. In Section III 



we elaborate on the theory and the framework of the first approach (equivalent rate constraint). 



In Section IV we elaborate on the theory and the framework of the second approach (Lynapnov 
stabihty drift). In Section |V} we elaborate on the theory and the framework of the third approach 
(MDP) and illustrate how the approximate MDP and stochastic learning could help to obtain 



low complexity and distributed delay-aware control solutions. In Section VT we discuss the 



delay-aware routing designs in multi-hop wireless networks. In Section |VII[ we compare the 
performance of the aforementioned approaches in a common application topology, namely the 
uplink OFDMA systems with multiple users. Finally, we conclude with a brief summary of the 
results in Section IVIIIl 

II. System Model and General Cross-Layer Optimization Framework 

In this section, we elaborate on the system model, the queue model, the framework of resource 
control for general wireless networks. We also use the uplink OFDMA systems as an example 
in the elaboration to make the description easy to understand. 
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Fig. 2. Illustrative diagram of a multi-hop wireless network with N = {1,2, 9}, C — {1,2,...,!} and C — {1,2, 3}. 

A. System Model 

In this paper, we study delay-aware resource control in a general multi-hop wireless network 
with a set of N nodes J\f = {1, 2, A^} and a set of L transmission links C = {1,2, L} as 
illustrated in Fig. [2j Each link in set C denotes a communication channel for direct transmission 
from node s G A/" to node d G A/", and is labeled by the ordered paiij^ (s, d). We denote s{l) and 
d{l) as the transmit node and the receive node of the /-th link, respectively. 

The network is assumed to work in slotted time with slot boundaries that occur at time 
instances t G {1,2,...}. We use slot t to denote the time interval [t,t + 1). Denote H(t) = 
[Hi(t), H2(t), ...,Hi^(t)] G H as the CSI of all L links in set C in slot t, where T-L denotes the 
system CSI state space. We have the following assumption on the channel fading. 

Assumption 1 (Assumption on the Channel Fading): Each element in H takes value from the 
discrete state space 1-L and the system CSI H(t) is a Markov process, i.e., 

Pr [H(t)|H(t - 1), H(t - 2), H(0)] = Pr [H(t)|H(t - 1)] . 

■ 

'Note that (s, d) and (d, s) denote two different transmission links: the former is the link from the s-th node to the d-th node, 
whereas the latter is the link from the d-th node to the s-th node. 
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The general network model described above encompasses a wide range of practical network 
topologies. 

B. Source Model 

All data that enters the network is associated with a particular commodit)]^ c E C, which 
minimally defines the destination of the data, but might also specify other information, such as 



the source node of the data or its priority service class [ 13 1. C = {1, 2, ■ ■ ■ , C} represents the set 
of C commodities in the network. Let Xn\t) denote the amount of new commodity c data (in 
number of bits) that exogenously arrives to node n at the end of slot t. We make the following 
assumption on the arrival process. 



Assumption 2 (Assumption on Arrival Process): The packet arrival process Xn\t) G [0, Ai'Jm 
is i.i.d. over scheduling slots following general distribution with average arrival rate E[An^(t)] = 

^. ■ 

Each node n maintains a set of queues for storing data according to its commodity. Let Qn^ (t) 
denote the queue length (in number of bits) of commodity c stored at node n. Note that we let 
Q(f)(t) = for all t if node n is the destination of commodity c. Let ij!f\t) denote the rate 
offered to commodity c over link / during slot t. Therefore, the system queue dynamics is given 



max 



by [131 



le{l.s(l)=ri} J l&{l:d{l)=n} 

The above expression is an inequality rather than an equality because the actual amount of 
commodity c data arriving to node n during slot t may be less than 'Yl,ie{i:d{i)=n} 

fxl^\t) if the 

neighboring nodes have little or no commodity c data to transmit. For notational convenience, 
we define the QSI as Q(t) = [Qn^t)] G Q, where Q denotes the system QSI state space. 

C. Control Policy and Resource Control Framework 

Let x(^) = {H(t),Q(t)} E X he the system state which can be estimated by the resource 
controller at the t-th slot, where X = T-Lx Q is the full system state space. In practice, different 
control pohcies may be adaptive to partial or full system states. For example, a CSI-only control 

^The commodity index c can be interpreted as the data flow index in the network. 
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policy has control actions that are adaptive to the partial system state CSI only. A QSI-only 
control policy has control actions that are adaptive to the partial system state QSI only. A cross- 
layer control policy has control actions that are adaptive to the full system state, i.e., the CSI and 
the QSI. We define Vt : X ^ Aiohe, the control policy, which is a mapping from the full system 
state space X to the action space A. The control pohcy may include the resource allocation pohcy 
(e.g., power allocation policy, subcarrier allocation policy, precoder design policy, etc) and the 
routing policy. 

Under control policy VL, the average queue length of commodity c stored at node n is given 

by 

qI:^ = limsup - V E^[g(,^)(t)], Vn e AT, c e C, 

where E'^[-] means the expectation operation taken w.r.t. the measure induced by the given policy 
^l. We also introduce the average drop rate as a performance metric in our general system model 
to incorporate delay-aware resource control in queueing networks with finite buffer size (c.f.. 



[32 1), where data dropping is necessary when a buffer overflows. For queueing networks with 



finite buffer size Nq, the average drop rate of commodity c stored at node n is defined as 

-I T 

= lim sup - 5^ [l[Q(f) (t) = NqW , MneU^ceC. 



t=i 



Taking the effect of data dropping into consideration, we refer to the average delay as the average 
time that a piece of data stays in the network before reaching the destination (averaged over 
the data that are not droppecQ. This is because the penalty of data dropping is accounted for 
separately in the average drop rate. The following lemma extends Little's Law to the case with 
data dropping. 

Lemma 1 (Little's Law with Data Dropping): The average delay of all the commodities and 

^For example, suppose 100 packets enter a single-hop network, among which, 10 packets are dropped and the other 90 packets 
are successfully delivered to their destinations . Furthermore, the total time taken by the 90 packets to reach their destinations 
is 90. The average delay is given by 90/90 — 1 and the average drop rate is given by 10/100 — 0.1. 
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commodity c in the network are given by 

D < 



{c)nx(c)- 



(2) 



(3) 



where the above two inequalities are asymptotically tight for general multi-hop networks as 
(in ■* — )■ for all n G A/" and c G C. In addition, in single-hop queueing networks, the inequalities 
in ^ and ^ are tight for any dn"^ and An^max — 1, and the average delay of commodity c at 
node n is given by 



D 



Qn 



Proof: The proof can be easily extended from the standard Little's law |34| by considering 
the data that are not droppecQ We omit the details due to page limit. ■ 
Remark 1 (Interpretation of Lemma Lemma [T] establishes the relationship among the av- 
erage delay, the average queue length and the average drop rate in general networks. Given the 
average drop rate, the average delay bound for general multi-hop networks or the average delay 
for single-hop networks is proportional to the average queue length. Thus, the average queue 
length (and the average drop rate if data dropping happens) is commonly used in the existing 



literature [13| as the delay performance measure. ■ 
Moreover, under control policy the average throughput of link / G C and the average power 
consumption of node n E M are given by 



T;=limsup;^VE^ 

T^+oo -t — 



t=l 
T 



cec 



V/ G £ 



Pn =limsup;^y]E' 

T-s>+oo J — 



t=i 



«e{«:s(Z)=n} ceC 



Vn G A/" 



respectively, where p['^\t) denotes the power allocated to commodity c over link / at slot t. 

''since all the data (including the data that are ultimately dropped) contributes to the queue length process (before being 
dropped), we have inequalities in ([2} and l|3j. 
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Therefore, delay-aware resource control problems for wireless networks can be divided into 
the following three categories: 

• Category I: Maximize the average weighted sum system throughput (or average arrival 
rate) subject to average delay constraints, average power constraints and average drop rate 
constraints. Thus, the delay-aware resource control problem can be expressed as 

max ^ wiTi (4) 

s.t. Qi^<Qi'\ WneAf,ceC 

K<Pn, yneX 

d^n^<S^\ \/neJ\f,ceC, 

where wi is the weight for the l-th link, Qi^\ P„ and dn^ are the average delay constraint, 
the average power constraint and the average drop rate constraint for commodity c at node 
n, respectively. 

• Category II: Minimize the average weighted sum delay subject to average power constraints 

and average drop rate constraints for given arrival rates at all sources. Thus, the delay-aware 
resource control problem can be expressed as 



min Y.Y.'^nQ^^ (5) 

neAf cec 

s.t. K<Pn, yneAf 

where Wn^ is the weight for commodity c at node n. 
• Category III: Minimize the average weighed sum power consumption subject to average 
delay constraints and average drop rate constraints for given arrival rates at all sources. 
Thus, the delay-aware resource control problem can be expressed as 

min ^ WnPn (6) 

neA/" 

S.t. Q^n^ <Q^^\ yneJ\f,ceC 

d^n^<d^^\ yneAf,ceC, 
where Wn is the weight for node n. 
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Fig. 3. Block diagram of uplink OFDMA systems. 



Remark 2 (Unified Optimization Framework): Note that the Lagrangian function of all the 
above optimization problems can be written in a unified form: 

/e£ neAf cec neAf 

(c) (c) 

where ^i, z/„ , rjn and 7„ can be Lagrange Multipliers associated with the constraints or weights 
in the objective function. Hence, these problems can be solved by a common optimization 
framework. ■ 



D. Uplink OFDMA Systems 



In this part, we illustrate the general network model in Section II-A with a simple example 
of one-hop uplink OFDMA systems. This example topology will also be used as illustration to 
the delay-aware resource control in later sections. 

In the uplink OFDMA system example illustrated in Fig. |3} we assume the set of nodes J\f 
are mobile stations (MSs) that communicate with one base station (BS). Each MS and the BS is 
equipped with a single antenna. Therefore, the set of links C corresponds to the set of all uplink 
channels from the N MSs to the BS (with L = N). Furthermore, there are N data flows (with 
C = N), and for notation simplicity, we use / G {1, 2, ■ ■ ■ , A^(= L)} to denote the link index, 
the node index as well as the commodity index. We consider communications over a wideband 
frequency selective fading channel, and the whole spectrum is divided into Np orthogonal flat 
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fading frequency bands (subcarriers). Let Hi mif) denote the CSI of the Z-th uplink on the m-th 
(m G {1, 2, ■ ■ ■ , Np}) subcarrier and let Hi(t) = [Hi^i{t), Hi^2{t)-, Hi^Np{t)] denote the aggre- 
gate CSI over all subcarriers of the /-th link. The system CSI H(t) = [Hi(t), H2(t), HAr(t)] 
is a Markov process satisfying Assumption [T| In addition, we assume {Him{t)} are i.i.d. w.r.t. 
I E {1,2,- ■■ , L} and m E {1,2, ■■■ ,Nf}. Let Qi{t) and si^rn{t) G {0,1} denote the queue 
length and the subcarrier allocation for the /-th link on the m-th subcarrier at slot t, respectively. 
The received signal from the /-th user on the m-th subcarrier of the BS at slot t is given by 

YL^{t) = SLm{t) {Hi^m{t)Xi^ra{t) + , / G £, m G {1, 2, ■ ■ ■ , N p} , 

where Xi^rn{t) is the transmit symbol and Zi^m{t) ~ CJ\f{0, 1) is the channel noise of the /-th 
link on the m-th subcarrier at slot t. Hence, the data rate of the /-th link on the m-th subcarrier 
at slot t is given by 

Rl,mit) = Sl,m{t) \0g2 {I +Pl,m{t)\Hi,mit)\^), I ^ C, m G { 1 , 2, ■ ■ ■ , A^^} , 

where pi^m{t) is the transmit power over the /-th link on the m-th subcarrier at slot t. The sum 
rate of the /-th link at slot t is given by = X]m=i Ri,m{t)- 

In this uplink OFDMA system example, the control policy for the /-th link is given by Vti = 
iQi^p,Qi^s), where the power allocation policy Qi^p and the subcarrier allocation polic>|^n/_s are 
defined as follows. 

Definition 1 (Power Allocation Policy): The power allocation policy of the /-th link is a map- 
ping X ^ Vi from the system state to the power allocation action, which is given by 

^lAx) = {pi,m > : m G {1, 2, ■ ■ ■ , AT^}} G P,, V/ G £, (7) 

where pi^m is the transmit power on the m-th subcarrier of the /-th link. ■ 
Definition 2 (Subcarrier Allocation Policy): The subcarrier allocation policy of the /-th link 
is a mapping X ^ Si from the system state to the subcarrier allocation action, which is given 
by 

^lAx) = {si,m G {0, 1} : m G {1, 2, ■ ■ ■ , A^^}} G Si, V/ G £, (8) 

^Please note that when Np = 1, i.e., there is only one carrier, the subcarrier allocation is reduced to link selection. Therefore, 
the subcarrier allocation policy considered in the following problem formulation covers most of the cases in resource allocations 
for single-hop wireless networks. 
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where s/ „, = 1 means that the m-th subcarrier is used by the l-th link for data transmission, 
and si^rn = otherwise. ■ 

III. Equivalent Rate Constraint Approach 

The first attempt in the literature to deal with the complicated delay control problem is to 
consider an equivalent problem in the PHY layer domain only, i.e., converting average delay 
constraints into average rate constraints using the large deviation theory Q, [[8||, p0|-p2|. This 



approach can be traced back to the early 90's, when the statistical quality of service (QoS) 
requirements have been extensively studied in the context of effective bandwidth theory ||35|- 



|38|, which asymptotically models the statistical behavior of a source traffic process in the wired 
networks (e.g., asynchronous transfer mode (ATM) and Internet protocol (IP) networks). 

For notational simplicity, we consider a single queue in the following introduction of the known 
results on the large deviation theory. Let A{t) represent the amount of source data (in number 
of bits) over the time interval [0,t). Assume that the Gartner-Ellis limit of A{t), expressed as 
Ab(6') = limt^oo — — exists for all 6 > 0. Then, the effective bandwidth function of A{t) 
is defined as 

£^B(^) = ^ = lim^logE[e^^W]. (9) 
Consider a queue with infinite buffer size served by a channel with constant service rate R. By 



the large deviation theory [39|, it is shown in [35| that the probability of the delay D{t) at time 



t exceeding a delay bound -Dmax satisfies: 

supPr[D(t) > D^ax] ~ ^{R)e-'^''^''-% (10) 
t 

where -f{R) = Pr [D(t) > 0] is the probability that the buffer is nonempty and 9{R) = RE]^^{R) 
is the QoS exponent (i.e., the solution of Eb{9) = R multiplied by R). Both 7(-R) and 9{R) 
are functions of the constant channel capacity R. Thus, a source, which has a common delay 
bound -Dmax and can tolerate a delay bound violation probability of at most e, can be modeled by 
the pair {'y(R),6{R)}, where the constant channel capacity should be at least R with R being 
the solution of 7(i?)e^^(^)^'"^'= = e. The intuitive explanation is that the tail probability that 
the delay D(t) exceeds -Dmax is proportional to the probability that the buffer is nonempty and 
decays exponentially fast as the threshold -Dmax increases. The QoS exponent [j7| 9{R) can be 
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interpreted as the indicator of the QoS requirement, i.e., a smaller 9{R) corresponds to a looser 
QoS requirement and vice versa. As a result, the effective bandwidth is defined as the minimum 
service rate required by a given arrival process for which the QoS exponent requirement is 
satisfied. 

Inspired by the effective bandwidth theory, where the constant service rate R is used in the 
source traffic modeling in wired networks, the authors in [j7| use the constant source traffic rate 
A to model a wireless communication channel. They propose the effective capacity, which is 
the dual of the effective bandwidth. Let S{t) = Y1t=i ^('^) represent the amount of service (in 
number of bits) over the time interval [0, t). Assume that the Gartner-Ellis limit of S{t), expressed 
as Ac{0) = limt_>oo '"^^^"^^ — — exists for all 9 > 0. Then, the effective capacity function of S{t) 
is defined as 

Ecie) = = - lim llogE [e-^^W] . (11) 

If we further assume the process {R{t)} is uncorrelated, then the effective capacity reduces to 

i^c(^)=-^log(E{e-^««}). (12) 

Consider a queue of infinite buffer size served by a data source of constant data rate A (in 
number of bits). Similar to the effective bandwidth case, it is shown in [|7| that the probability 
of the delay D{t) at time t exceeding a delay bound -Dmax satisfies: 

supPr[D(t) > Anax] ~ 7(A)e-'(^)^— , (13) 
t 

where 7(A) = Fr[D{t) > 0] is the probability that the buffer is nonempty and the QoS exponent 
is 9(\) = A-E'^^(A). Both 7(A) and 9(\) are functions of the constant source rate A. Thus, 
a source, which has a common delay bound Z^max and can tolerate a delay bound violation 
probability of at most e, can be modeled by the pair {7(A), ^^(A)}, where the constant data rate 
should be at most A with A being the solution of -y(A)e~^'^'^)^™="' = e. Therefore, as the dual of 
the effective bandwidth, the effective capacity is defined as the maximum constant arrival rate 
that a given service process can support in order to guarantee a QoS requirement specified by 

e. 

With the above observation, we can incorporate the QoS requirement into a pure PHY layer 
requirement. By interpreting 9 as the QoS constraint, the throughput maximization problem 
subject to the delay QoS constraint (in terms of the exponential tail probability of the queue 
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length distribution) can be directly transformed into an effective capacity maximization problem 
for a given QoS exponent 9. This approach is widely used when the QoS constraint is specified 
in terms of the QoS exponent 9. Interested readers can refer to [|9|-[11| and references therein 
for more detailed descriptions. 

A more careful treatment of the average delay constraint is developed in [[8|, where packet 
flow model is considered. The principle idea behind this approach is to establish the relationship 
among the average delay requirement D, the average arrival rate A and the average service rate 



/X using the queueing theory framework p4| . The following procedures are performed to obtain 
this relationship. 

1) Express the average system delay in terms of the average residue service time and the 
average queueing delay. 

2) Express the average residue service time in terms of the moments of the service process. 

3) Establish the relationship among the average delay requirement, the average arrival rate 
and the average service rate. 

Example 1 (Equivalent Rate Constraint Approach for Uplink OFDMA Systems): In the uplink 
OFDMA system example, we assume the buffer size for each link is infinite (as in [8]). Hence, 
the optimization problem (|6]) can be simplified as follows [|8|: 



mm 



S.t. 



L Nf 



E E p^'^ 

1=1 m=l 

SL„^G{0,l},V/G£,mG{l,2, 



Nf}, ^ Si^rn 
1=1 



m=l 

[Qi{t) 



< P,, ^leC 



< Di, Wl G C. 



(14) 

1, V/ G £ (15) 

(16) 
(17) 



Following the standard procedures shown above, the average delay constraint ([17]) can be replaced 
with an equivalent average rate constraint given by [[8| Lemma 1] 

Np 



E^ 



Sl,m log2(l + Pl,m\Hl^r, 



m=l 



> 



(2AA/ + 2) + V(2AAi + 2)2 - 8AA/ 
4A 



-N, 



(18) 



where A^^ and A; are the average packet size and the average arrival rate of the l-th link, respec- 



tively. Applying the above results, the original optimization problem ( [T4| ) can be reformulated 
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as follows: 



min E 



L Nf 



PLr 



J=l m=l 



(19) 



s.t. ([B]), ([161), &■ (20) 



Optimization problem (fT9l)-([20l) is a mixed combinatorial (w.r.t. integer variables {s; „i}) and 



convex optimization problem (w.r.t. {pim})- If we relax the integer constraint m G {0, 1} 



into real values, i.e., si.m G [0, 1], the resultant problem ([19]) would be a convex maximization 
problem. Using standard Lagrange Multiplier techniques, we can derive the optimal subcarrier 
and power allocation as follows: 

_ 1, if X;,„ = maxj- [Xj^m] > 



0, otherwise 

ll \Hl,m\ 



Pl,m = Sl^rn ( ^^^"^ - , rr^ |2 ) ' ^^2) 



where 



X,^ = (1 + log, I 1 + \H,^\' { ^ - ) - 7/ ^ ^ 



is the Lagrange multiplier corresponding to the average power constraint in ( [T6| ) and ui is the 
Lagrange multiplier corresponding to the transformed average rate constraint ( [T8] ) for the /-th 
MS. ■ 
This approach provides potentially simple solutions for single-hop wireless networks in the 
sense that the cross-layer optimization problem is transformed into a purely information theoret- 
ical optimization problem. Then, the traditional PHY layer optimization approach, such as power 
allocation and subcarrier allocation, can be readily applied to solve the transformed problem. 
The optimal control policy is a function of the CSI with some weighting shifts by the delay 
requirements and hence it is simple to implement in practical communication systems. However, 
the implicit assumption behind this approach is that the user traffic loading is quite high, or 
equivalently, the probability that a buffer is empty is quite low (i.e., the large delay regime). For 
general delay regime, the delay-optimal control policy should be adaptive to both the CSI and 
the QSI and the performance of the "equivalent rate constraint" approach is not promising, as 



we shall illustrate in Section VII In addition, due to the complex coupling among the queues 



in multi-hop wireless networks, it is difficult to express delay constraints in terms of all the 
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control actions, including routing. Therefore, this approach cannot be easily extended to multi- 
hop wireless networks. 

IV. Stochastic Lynapnov Stability Drift Approach 

Another important method to deal with delay-aware resource control in wireless networks is 
to directly analyze the characteristics of the control policies in the stochastic stability sense using 
the Lyapunov drift technique. The Lyapunov drift theory has a long history in the field of discrete 



stochastic processes and Markov chains [ |40| , pT| . The authors of [ |42| first applied the Lyapunov 
drift theory to develop a general algorithm which stabilizes a multi-hop packet radio network 
with configurable link activation sets. The concepts of maximum weight matching (MWM) and 
differential backlog scheduling, developed in [|42|, play important roles in the dynamic control 
strategies in queueing networks. The Lyapunov drift theory is then extended to the Lyapunov 
optimization theory. In this section, we first introduce the preliminaries and the main results on 
the Lynapnov stability analysis; after that, we present two examples, one for the Lyapunov drift 
theory and the other for the Lyapunov optimization thoery. 

A. What is Queue Stability? 

First, we introduce the definition of queue stability as follows. 
Definition 3 (Queue Stability): 

T 

1) A single queue is strongly stable if limsup |; XI ^ [Qit)] < oo. 

T^+oo t=l 

2) A network of queues is strongly stable if all the individual queues of the network are 
strongly stable. 

■ 

A queue is strongly stable if it has a bounded time average backlog. Throughout this paper, we 
use the term "stability" to refer to strong stability. Based on the definition of stability, we define 
the stability region as follows. 

Definition 4 (Stability Region): The stability region Aq of a policy is the set of average 
arrival rate vectors |An''|?2 G Af,c ^ ^| for which the system is stable under Q. The stability 
region of the system A is the closure of the set of all average arrival rates |Ajf''|n G A/", c G c| 
for which a stabilizing control policy exists. Mathematically, we have 

A = U (23) 
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where G denotes the set of all stabilizing feasible control policies. ■ 
Definition 5 (Throughput-Optimal Policy): A throughput-optimal policy dominate^ any other 
policy in G, i.e. has a stability region that is a superset of the stability region of any other policy 
in G. Therefore, it should have a stability region equal to A. ■ 
In other words, throughput-optimal policies ensure that the queueing system is stable as long 
as the average arrival rate vector is within the system stability region. Three classes of policies 
that are known to be throughput-optimal are the Max Weight rule (also known as M-LWDF/M- 



LWWF p6| in single-hop wireless queueing systems), the Exponential (EXP) rule p4| and the 
Log rule p3| . The throughput-optimal property of Max Weight type algorithms is proved by 
the theory of Lyapunov drift p6| , which is introduced in the next part. The Log rule is also 
proved to be throughput-optimal by the theorem related to the Lyapunov drift in [15]. On the 
other hand, the EXP rule is proved to be throughput-optimal by the fluid limit technique along 



with a separation of time scales argument in [14|. 



B. Main Results on Lyapunov Drift 

In order to show the stability property of the queueing systems, we rely on the well-developed 



stability theory in Markov Chains using negative Lyapunov drift p()[ , pT] |, p3|-p3|. We use the 
quadratic Lyapunov function L(Q) = YlneM cec (Q""*)^ the system queue state Q through 
the rest of the paper. Based on the Lyapunov function, we define the (one-slot) Lyapunov drift 
as the expected change in the Lyapunov function from one slot to the next, which is given by 

A(Q(t)) 4 E [L(Q(t + 1)) - L(Q(t))|Q(t)] . (24) 



Therefore, the theory of Lyaponov stability is summarized as follows [13|, |46|: 

Theorem 1 (Lyapunov Drift): If there are positive values B, t such that for all time slot t we 
have: 

A(Q(t))<i?-e Qn\t). (25) 

then the network is stable, and the average queue length satisfies: 

T 

limsupi^ E[Q(r)(t)] <^. 

t=l n^N,c£C 

*A policy 0,1 dominates another policy fl2 if C An^ 
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Note that if the condition in ([25]) holds, then the Lyapunov drift A(Q(t)) < -6 (V5 > 0) 
whenever XlneA/" cec — "^T^- Intuitively, this property ensures network stability because 

whenever the queue length vector leaves the bounded region for the sum queue length, i.e., 
{Q ^ : EnGA^,cec<5"^(^) ^ ^^l' the negative drift A(Q(t)) < -6 eventually drives it back 
to this region. 

Next, we illustrate how to use the Lyapunov drift to prove the stability of queueing networks 
and develop stabilizing control algorithms. Define the maximum input rate and output rate of 



node n as fi^^^^^^ = snptJ2 cec T.ie{i:d(i)=n} l^l^^^) and /^l^tx,™ = s^PtJ2cecT.ie{i:s{i)=n} 
respectively. They are finite due to the resource allocation constraints. Assume the total exogenous 
arrival to node n is bounded by a constant X^^"^ = sup^ X^cec ^n\t). From the queue dynamics 
in ([T]), we have the following bound for Lyapunov drift p3| : 



A{Q{t))<B + 2 J2 Qi'^W^- 

neAf,ceC 



ic) 



2E 



where 



B 



le{l:s(l)=n} 



l£{l:d{l)=n} 



(26) 



VVr''max,n/ ' ' /^max,?!/ / 



(27) 



The dynamic backpressure algorithm (DBP) is designed to minimize the upper bound of the 
Lyapunov drift (the R.H.S. of ([261)) o^er all policies at each time slot. For single-hop wireless 
networks, we use the link index / to specify each queue instead of the node index n and the 
commodity index c for notational simplicity. From ( [26) ), the single-hop dynamic backpressure al- 
gorithm (M-LWDF/M-LWWF) maximizes Xl/e£ Qi{^)l^i{^) under resource allocation constraints. 
Based on Theorem [T] it is shown that the DBP algorithm is throughput-optimal p5| , [26|. 
After the introduction of dynamic control algorithm designs in [[42|, the Lyapunov drift 



approach is successfully used to optimize the allocation of computer resources [17|, stabilize 
packet switch systems [|T8|-[21 1, satellite and wireless systems [22|-[24|. For example, the 
concepts of MWM and differential backlog scheduling are first developed in [ [42| based on the 
Lyapunov drift theory. Using the linear programming argument and the Lyapunov drift theory, 
it is proved that a MWM algorithm can achieve a throughput of 100% for both uniform and 
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nonuniform arrivals in pT|. Based on the analytical techniques of the Lyapunov drift, the bounds 



on the average delay and queue size averages as well as variances in input-queued cell-based 



switches under MWM are derived in [20|. By the Lyapunov drift theory, the Longest Connected 



Queue (LCQ) algorithm is proved in [22| to stabilize the system under certain conditions and 
minimize the delay for the special case of symmetric queues (i.e., queues with equal arrival, 
service and connectivity statistics). Due to page limitation, we refer the readers to the above 
references for the details. 

C. Main Results on Lynapnov Optimization 

The Lyapunov drift theory is extended to the Lyapunov optimization theory, through which 
we can stabilize queueing networks while additionally optimize some performance metrics and 



satisfy additional constraints [13|. 



Let x(i(:) = (xi(t),X2(t), ■ ■ ■ ,xx(t)) represent any associated vector control process that 
influences the dynamics of the vector queue length process Q(t). Let g : — )■ M be any scalar 
valued concave function. Define x(T) = ^Ylt=i^{^i't)] ^^'^ 9 — ^'^^^'^Pt-^oo j^^loi'^i't))]- 
Suppose the goal is to stabilize the Q(t) process while maximizing g{-) of the time average of 
the x(t) process, i.e., maximizing g(x), where x = limsup-r^^oo x(T). Let g* represent a desired 
"target" utility value. The theory of Lyapunov optimization is summarized as follows: 

Theorem 2 (Lyapunov Optimization): If there are positive constants V^e^B such that for all t 
and all Q(t), the Lyapunov drift satisfies: 



A(Q(t))-V^E[(7(x(t))|Q(t)] <5-e Q^:\t)-y9\ (28) 



then, we have: 



limsupif: E E^Ht)] < ""^^^^""^^ (29) 

t=l n£N,c£C 

T 

^'¥'J^^\Y.3i^it)) > 9*-^. (30) 

°° t=i 



Note that a similar result can be shown for minimizing a convex function h : — )• M by 
defining g(-) = —h{-) and reversing inequalities where appropriate. 
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Theorem |2] is most useful when the quantity g — g* can be bounded by a constant. Specifically, 
if g — g* < Gmax, then the lower bound on the achieved utility g can be pushed arbitrarily 
close to the target utility g* with a corresponding increase (linear in V) in the upper bound 
on lim sup2-_j.oo ^ J2j=i J2neAf cec ^[Qn\t)]. The Lyapunov optimization theorem in Theorem [2] 
suggests that a good control strategy is to greedily minimize the following drift metric at every 
time slot 

A(Q(t))-rE[<7(x(t))|Q(t)], 



i.e., the L.H.S. of (28) 



Next, we introduce the Energy-Efficient Control Algorithm (EEC A) fTJ), which utilizes the 
Lyapunov optimization theory to develop an algorithm that stabilizes the system and consumes 



an average power that is arbitrarily close to the minimum power solution. From ( [261 ), we have 



A(Q(t)) + V^E 



< 



-E 



(31) 



EEC A is designed to minimize the R.H.S. of the inequality in ( [31] ) over all possible power 
allocation strategies. For single-hop wireless networks, we use link index / to denote the node 



index as well as the commodity index. From ( [31] ), we have that the single-hop EECA maximizes 
J2iec {'^Qi{^)l^i{^)~^ Pi{^)) o^^r possible power allocation strategies at each slot t. Based on 
Theorem |2} it is shown that the EECA is throughput-optimal and can achieve [(9(1/1^), 
power-delay tradeoff by adjusting the parameter V [13|. 

The Lyapunov optimization theory also has applications in the transport layer flow control 
and network fairness optimization when the exogenous arrival rates are outside of the network 
stability region. The Cross Layer Control (CLC) algorithm is designed to greedily minimize the 



R.H.S. of (28) to achieve a utility of exogenous arrival rates, which is arbitrarily close to optimal 



while maintaining network stability [ [T3| , [ [25| , [ [26| . 

Remark 3: Note that the average delay bounds developed in Theorem [T] and Theorem [2] are 
tight only when the traffic loading is sufficiently high, while the tightness of such delay bounds 
for moderate and light traffic loading is not known. ■ 
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D. Methodology and Example 

To stabilize queueing networks using the Lyapunov drift theorem in Theorem [T] or stabilize 
queueing networks while additionally optimizing some performance metrics (e.g., maximizing 
the average weighted sum system throughput in Q, minimizing the average weighted sum 
power in (|6]), etc) using the Lyapunov optimization theorem in Theorem [2} the procedure can 
be summarized as follows: 

1) Choose a Lyapunov function and calculate the Lyapunov drift A(Q(t)) or A(Q(t)) — 
VE[(7(x(t))|Q(t)], where giit) is the utility to be maximized. 

2) Based on the system state observations, minimize the upper bound of A(Q(t)) or A(Q(t)) — 
yE[(7(x(t))|Q(t)] over all polices at each time slot. 

3) Transform other average performance constraints into queue stability problems using the 
technique of virtual cost queues [13|, |26| if needecQ 

In the following, we illustrate how to apply the Lynapnov drift approach and the Lynapnov 
optimization approach in resource allocation for uplink OFDMA systems, respectively. 

Example 2 (Lynapnov Drift Approach for Uplink OFDMA Systems): In the uplink OFDMA 
system example, the dynamic backpressure algorithm under the subcarrier allocation constraints 



in ( [T5| ) and the average power constraints in ([16]), can be obtained by solving the following 
optimization problem 

L Nf 

max VQK^)y]^/,m(t), Vt (32) 

1=1 m=l 



s.t. ([15]), ([16]) are satisfied. (33) 

Similar to Example [T| by applying continuous relaxation (i.e., si^m £ [0, 1]) and standard convex 
optimization techniques, we can derive the optimal subcarrier and power allocation as follows: 

{1, if Xi^rn = maxj {Xj-^} > 
0, otherwise 

(Qiit) 1 V 

Pl,m = Sl^rn 777 , (35) 



7z \Hi 



Lm 



^Please refer to joj , {l^ for the details of the virtual cost queue technique, which we omit here due to paper limitation. 
In all the examples, we use the Lagrangian techniques to deal with these average constraints to facilitate the derivation of the 
closed-form solution to obtain certain insights and comparisons of the solutions obtained by different techniques. 
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where 



Qi{t) 1 fQiit) 



Xi,^ = Qi{t) log, [1 + \Hi,m\' i^^-YTj^] ,„ ,2 



and 7/ is the Lagrange mukiplier corresponding to the average power constraint in ( (T6| ) for the 
l-th MS. ■ 

Example 3 (Lynapnov Optimization Approach for Uplink OFDMA Systems): In the uphnk OFDMA 
system example, the average sum power minimization in ([6]) (with all weights 1 for illustration 



simplicity) under network stability constraint and the subcarrier allocation constraints in ( [15] ) 
based on Theorem |2] is given by 

_max 5^ 2gKi)X^^/,mW-^X^Rm , Vt (36) 

/=1 \ m=l m=l / 



s.t. (15) is satisfied. (37) 



Similar to the previous example, we can derive the optimal subcarrier and power allocation as 
follows: 

J 1, if Xi^m = maxj {Xj^m} > 
I 0, otherwise 

Pl,m = Sl^m I T^^^ 1^ ) , (39) 



l,m I 



where 



V \m^m\y J \ V \Hi^m 

Note that the parameter V is used to adjust the average power-delay tradeoff. ■ 
Remark 4: The Lyapunov stability drift approach provides a simple alternative to deal with 
delay-aware control problems. The derived cross-layer control policies are adaptive to both the 
CSI and the QSI. The derived policies are also throughput-optimal (in stability sense). However, 
as we shall illustrate, throughput optimality (stability) is only a weak form of delay performance 
and the derived policies may not have good delay performance especially in the small delay 
regime. There are many recent studies focusing on delay reduction in the traditional DBP 



algorithm in multi-hop networks and we shall further elaborate this in Section VI 
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V. Markov Decision Process and Stochastic Learning Approach 

In wireless networks, the system state can be characterized by the aggregation of the CSI and 
the QSI. In fact, under Assumptions [T] and [2[ the system state dynamics evolves as a controlled 
Markov chain and the delay-optimal resource control can be modeled as an infinite horizon 
average cost MDP [32]. MDP is a systematic approach for delay-optimal control problems, 
which in general could give optimal solutions for any operating regime. However, the main issue 
associated with the MDP approach is the curse of dimensionality. For instance, the cardinality 
of the system state space is exponential w.r.t. the number of queues in the wireless network 
and hence solving the MDP is quite complicated in general. In addition, the optimal control 
actions are adaptive to the global system QSI and CSI but in some cases, these CSI and QSI 
observations are obtained locally at each node. Hence, a brute-force centralized solution will 
lead to enormous complexity as well as signaling loading to deliver the global CSI and QSI 
to the controller. In this section, we briefly summarize the key theories of MDP and stochastic 
approximation (SA) and illustrate how we could utilize the techniques of approximate MDP 
and stochastic learning to overcome the complexity as well as the distributed implementation 
requirement in delay-aware resource control. 

A. Why Delay-Optimal Control is an MDP? 

In general, an MDP can be characterized by four elements, namely the state space, the action 
space, the state transition probability and the system cost, which are defined as follows: 

• X = {x^, x^, ■ ■ ■ }: the finite state space with \X\ states; 

• A = {a}, a?, - ■ ■}: the action space; 

• Pr[x'|x, a]: the transition probability from state x to state under action a; and 

• fi'(X)'^)" the system cost in state x under action a. 

Therefore, an MDP is a 4-tuple (A', Pr[-|-, ■],g{-, •)). A stationary and deterministic control 
policy r2 : A" — 7- ^ is a mapping from the state space X to the action space A, which determines 
the specific action taken when the system is in state x- Given policy Vt, the corresponding random 
process of the system state and the per-stage cost {x{t),g(t)) evolves as a Markov chain with 
the probability measure induced by the transition kernel Pr[x'|x, ^{x)]- The goal of the infinite 
horizon average cost problem is to find an optimal policy Vl* such that the long term average 



October 21, 2011 



DRAFT 



25 



cost is minimized among all feasible policies, i.e., 

1 ^ 

minlimsup ;^ ^ [(? n (x(t)))] 



t=i 

in 



T 

where E^' denotes the expectation operator taken w.r.t. the probability measure induced by the 
control policy If the set of feasible policies are unichain policies, then the optimization 
problem can be written as 

1 ^ 

minlimsup ^V^E'' [g (x(t), n{x{t)))] = minE^^^) [g (x, ^ix))] , 

T-s>oo 

where tt(^1) is the unique steady state distribution given policy il. 

Assume the buffer size is finite and denoted as Nq. The system queue dynamics Q(t) evolves 
according to ([T]) with projection onto [0, Nq] and the arrival, departure and the CSI processes are 
Markovian under Assumptions [T] and [2| Hence, the system state is a finite state controlled 
Markov chain with the following correspondence: 

• The system state space in the delay-optimal control problem is defined as the aggregation 
of the system CSI and the system QSI, thus X = T-L x Q. Q and X are both finite. 

• The action space is the space of all control actions, including the resource allocation actions 
(e.g., power allocation actions and subcarrier allocation actions) and the routing actions. 

• The transition kernel is given by 

Pr [x{t + l)\x{t),nixit))] = Pr [Qit + l)\xit),n{xit))] Ft [U{t + l)|H(t)]. 

• By the standard Lagrangian approach, the optimization problem in (|5]) can be transformed 
as follows 

1 ^ 

min = lim sup — E 



neAf ceC n&J\f 

(c) 

where 7„ and rjn are Lagrange multipliers corresponding to the average power and average 
drop rate constraints. Therefore, the per-stage system cost function given a system state x 
is defined as 

g{x, n{x)) = E E i^n^Qn^ + "^n^^iQn = Nq]) + E Tn^n, (40) 
neAf ceC neAf 

As a result, there is a one-one correspondence between the delay-optimal control problem and 
the MDP. The average delay minimization problem under average power constraint in ([5]) can be 
modeled as an infinite horizon MDP to minimize the average cost (delay) per stage as follows 
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Problem 1 (Delay-Optimal MDP Formulation): 

1 ^ 

minL^ = minlimsup ;^ V (x (t) , ft (x (t)))]. (41) 

T-5-00 J 

■ 

Note that we restrict our policy space to the unichain policies where the induced Markov chains 
under all feasible unichain policies are ergodic and share the same state space X. In addition, 
we assume the induced Markov chains are irreducible and hence, the chains are ergodic with 
steady state distribution n(^l). In this case, the limit of infinite horizon average cost under pohcy 
f2 (i.e., L^) exists with probability 1 (w.p.l) and is independent of the initial state. 



B. Optimal Solution of the Delay-Optimal MDP 

Under the unichain policy space assumption, the delay-optimal control policy of the above 



MDP is given by the solution of the Bellman equation [32|. This is summarized in the following 
Lemma. 

Lemma 2 (Bellman Equation): If a scalar 9 and a vector V = [V"(x^), Vi^x^), ■ ■ ■] satisfy the 
Bellman equation for the delay-optimal MDP in Problem [T] 

9 + V{x') = min i^g {x\n{x')) + [x'\x\^{r)] , e X, (42) 

then 9 is the optimal average cost per stage 

9 = mm L^ = L*. (43) 



Furthermore, if Q* attains the minimum in ( [42| ) for any x' £ it is the optimal control policy. 



The Bellman equation (42) can be solved numerically by Offline Relative Value Iteration |32j 



under certain conditions. While the general solution of the MDP in (41) can be expressed as 
a Bellman equation in ( [42] ), this is still quite far from getting a desired solution. There are 
two major issues, namely the complexity issue and the signaling overhead issue. Although the 



relative value iteration approach p2| can give optimal solution to the MDP in (41 ), the solution 
is usually too complicated to compute due to the curse of dimensionality. For example, consider 
a wireless network with N queues; the total number of the system QSI states is {Nq + 1)^ (Nq 
is the buffer size of each queue), which grows exponentially with the number of queues. Thus, 
it is essentially impossible to compute the potential function at every possible state even for 
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wireless networks with a small number of queues. Another technical challenge is the distributed 
implementation requirement of the control algorithm. For instance, even if we could obtain the 
potential function V, the derived control will be centralized, requiring knowledge of the global 
system CSI and QSI at each time slot. This is highly undesirable due to the huge signaling 
overhead. From an implementation perspective, it is desirable to obtain distributed solutions 
where each node computes the control action based on the local CSI and QSI only. 

In the following sections, we propose two novel approaches to address the above complexity 
and overhead issues using approximate MDP and stochastic learning. We first briefly summarize 



some major preliminary results on stochastic approximation [47|, [48 1. The stochastic approxi- 
mation algorithm considered in this paper can be characterized by the following (i-dimensional 
recursion 

X„+i = X„ + e„ [h(X„) + Z„] , (44) 

where X.„ = [X„(l), X„(2), ■ ■ ■ ,Xn{d)]^ is a (i-dimension vector, and {e„} is a sequence 
of positive step sizes. If the following conditions are satisfied, we have Theorem [3] on the 
convergence property (Theorem 2, p8|). 

• The map h : R'^ W'- is Lipschitz: ||h(x) - h{y)\ \ < L\\x - y\\, for < L < oo, 

• {Z„} is a Martingale difference sequence w.r.t. the increasing family of cr-field: 

= o-(Xm, Z„, m <n). 
Furthermore, {Z„} are square-integrable with 

E [||Z„+i||2|j;] < + ||XJ|2) a.s., n>0, 
for some constant C > 0. 



• The iterates in (44) remain bounded almost surely, i.e., sup„ ||X„|| < oo, a.s.. 



Theorem 3: Sequence X„ generated by (44) converges almost surely to a (possibly sample 
path dependent) compact connected internally chain transitive invariant set of the following 
ordinary differential equation (ODE) 

±{t) = h{X{t)). 
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C. Approach 1: Approximating Potential Functions 



In the existing literature on approximate MDP, Bellman and Dreyfus |49| first propose to 



use polynomials as compact representations for approximating potential functions. In [50|, |51| 



the authors discuss different approaches for reducing the size of the state space, which lead to 



compact representations of potential functions. On the other hand, in |52|, the authors develop 
several techniques for approximating potential functions using linear combinations of fixed 
sets of basis functions. However, these approaches are centralized and are only focused on 
reducing computational complexity. In this section, we propose a novel feature-based method 
that addresses both the complexity issue as well as the distributed requirement. 



Similar to Sections [111] and |IV| we consider the uplink OFDMA system example. For notational 
simplicity, we use the link index / G {1,2, - ■ ■ ,L} to denote the node index as well as the 
commodity index. Specifically, we have: 

• Xi = (HhQi) denotes the local state of the l-th link. Thus, the global state x G A" is the 
aggregation of the local system states of all links x = {Xi\^ ^ 

• Xi denotes the state space of the local states of the /-th link. Moreover, the elements in Xi 
are enumerated as Xi = {x[|r = 1, 2, ■ ■ ■ }, where r denotes the dummy index enumerating 
all the local states. 

• Local per-stage cost of the l-th link gi is given by 

9i{xh^iix)) = i^iQi + ii^pi,m + vdiQi = Nq). 

m=l 

Thus, the overall per-stage cost is given by g{x^Vt{x)) = Y.iec9i{xu^i{x))- 
We consider the linear approximation architecture of the potential function given below: 

L \Xi\ 

1 = 1 T=l 

with the vector form given by 

V = MV, (45) 
where {Vi{xi)} (V/ G C) are the per-link potential functions, which are defined to be the solution 



of the Bellman equation (42) on some pre-determined representative system states. We refer 



to these pre-determined subset of system states as the representative states. Without loss of 
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generality, we define the set of the representative states as — {x(^, "T") |^ £ "T" = 2, • • • , | A";]}, 
where x{1:'t) denotes the joint system state with Xiih^) = X[ (2 < t < and xi'il^T) = xh 
( V/' 7^ /). V = [V{x^),--- is the vector form of the original potential function 

(referred to as global potential function in the rest of this paper). The parameter vector V and 
the mapping matrix M are given below: 



V 



M = 



ViixX) ■ ■ ■ Vr{x\^\ V,{xl) ■ ■ ■ V,{xf\ V.ixl) ■ ■ ■ V.ix^^^) 

nx\ = x\] ■■■ i[xHxl*'], ••• ,i[xi = xi] ••• i[xi = xf'] 

' ' ' ) ' ' ' ) ' ' ' 

i[xl'^' = xi] ••• i[x[^^^xn ■■■ ,i[xf =xi] ••• i[xf = X?'] 



where we let V,{x\) = V2{xl) = • • • = Vl{xI) = and xj = (Hi Q] = 0) (VZ e C). Moreover, 
we define the inverse mapping matrix as 



i[x^ = x(i,i)]---i[x' = x(i,l'^i|)], 



, l[x^^x{m)]---l[x'^x{L,\Xi^\)] 



IW-^I = X(l, !)]••• Ifxl-^l = X(l, l^il)], • • • , IW-^I = X(i^, !)]••• Ifx'-^l = X(i^, I^lI 
Thus, we have 



V = M- V. 



(46) 



One challenge in utilizing the above approximate MDP is how to determine the per-link 
potential function V. Instead of solving the Bellman equation on the representative states, we 
estimate V using the stochastic approximation techniques. Specifically, the distributed online 
iterative algorithm is given by: 

Algorithm 1 (Distributed Online Algorithm for Estimating the Per-Link Potential Functions): 



Step 1 (Initialization): Start with a set of initial per-link potential vector Vq with Vifi{xl) ~ 
(V/ e £). 

Step 2 (Calculate Control Actions): Based on the realtime observation of the system state 
x{t) at slot t, calculate control actions according to 

n*Mt)) = argmm i J^^i (Xi(t)Mxm + ^2^^ [x'lxW, ^^(xW)] ^t(x^) > , (47) 
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Where xit) = {xi{t),---, xdi)} and V^ix') = Eti ViAxD- 

Step 3 (Update Per-Link Potential Functions): After the control policy has been deter- 
mined, update all the per-link potentials functions {Vi{xJ) '■ ^ ^ ^ \^i\} C^^ ^ ^) based 
on the real-time local observations Xz(t) = (iii{t),Qi{t)), as follows: 

Vi,+^ixJ) = ViAxJ) + ecii,rJ{9^^^^ 

I 

- (^gi{xln*,ix'))+Y,Eu'^[vUii[,Qiit + imi^^^^ 



(48) 



where c{l, r, t) = X]t'=o '^[x{t') = xi}i 't)] is the number of updates of the representative state 
x(Z,r) up to slot t, x^ — {Xi) ■ ■ ■ denotes the reference state and t = sup{t|x(i) = 



Step 4 (Termination): If ||Vt — Vt_i|| < 5.^, stop; otherwise, set t := t + 1 and go to Step 



2. 



Using the theory of stochastic approximation on the update equation in Step 3, the convergence 
of the above online algorithm is given below: 
Lemma 3 (Convergence of Algorithm^: Denote 

Ai_i = (1 - et-i)I + M-iF(fit)Met_i and B^.i = (1 - e^.i)! + M-iF(a_i)Met_i, (49) 

where ^It is the unichain control policy at slot t, F(f2i) is the transition matrix under the unichain 
system control policy Vlt, and I is the identity matrix. If for the entire sequence of control policies 
{ilt}, there exists an 5t > and some positive integer /3 such that 

[A/3_i ■ ■ ■ Ai](j^7) > [B/3_i ■ ■ ■Bi](j^/) > (5t \fi, 

where denotes the element in the i-th row and the J-th column and 6t = C'(ef), then the 

following statements are true: 

• The update of the parameter vector will converge almost surely for any given initial param- 
eter vector Vq, i.e., lim = Vqo a.s.. 

t— >-+oo 

•^From ([48j, we can observe that Vi^tixl) = Vi.oiXi) = (Vf > 0). 
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• The steady state parameter vector Vqo satisfies: 9e + Vqo = M ^T(MVoo), wliere is a 
constant. 

■ 

Proof: Please refer to Appendix A. ■ 
Remark 5 (Interpretation of the Conditions in Lemma Note that At and Bt are related to 



an equivalent transition matrix of the underlying Markov chain. Eqn. ( |49| ) simply means that the 
system state is accessible from all the system states after some finite number of transition 
steps. This is a very mild condition and is satisfied in most of the cases we are interested. ■ 
Example 4 (Approximating Potential Functions for Uplink OFDMA Systems): In the example 
of the uplink OFDMA system example, we consider packet flows and assume Poisson packet 
arrival with average arrival rate A/ (packet/s) and exponential packet size distribution with mean 
packet size A^^ (bit/packet) for the /-th MS. Given a stationary policy, define the conditional mean 
departure rate of packets of link / (conditioned on the system state x) Jli{x) = l^iix)/Ni = 
J2m=i ^i,m{x)/ (packet/s). Moreover, we assume that the scheduling slot duration (or frame 
duration) r (s/slot) is substantially smaller than the average packet inter-arrival time as well as 
the average packet service time (A/r ^ 1 and ^ There is a packet departure from 

the /-th queue at the (t + l)-th slot if the remaining service time of a packet is less than the 
current slot duration r. By the memoryless property of the exponential distribution, the remaining 
packet length (also denoted as Ni(^t)) at any slot t is also exponentially distributed. Thus, the 
conditional probability of a packet departure event at the t-th slot is given by 



Pr 



^'^^KT\xi{t),niixit)) 



Pr 



N, 



l^i{t) 

=l-exp(-7i,(x(t))r) ^7I,(x(t))r. (50) 

'This assumption is reasonable in practical systems. For instance, in the UL WiMAX (with multiple UL users served 
simultaneously), the minimum resource block that could be allocated to a user in the UL is 8 x 16 symbols — 12 pilot 
symbols=116 symbols. Even with 64QAM and rate | coding, the number of payload bits it can carry is 116 x 3bits=348 bits. 
As a result, when there are a lot of UL users sharing the WiMAX AP, there could be cases that the MPEG4 packet (around 
lOK bits) from an UL user cannot be delivered in one frame. In addition, the delay requirement of MPEG4 is 500ms or more, 
while the frame duration of Wimax is 5ms. Hence, it is not necessary to serve one packet during one scheduling slot so that 
the scheduler has more flexibility in allocating resources. Therefore, in practical systems, an application level packet may have 
mean packet length spanning over many time slots (frames) and this assumption is also adopted in (53l-|56|. 
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Note that under assumption A^r <^ 1 and ~jli{x)'^ ^ 1» the probability for simultaneous arrival, 
departure of two or more packets from the same queue or different queues and simultaneous 
arrival as well as departure in a slot are 0[{\itY), 0[{jIi{x)Ty) and 0[{\it) • (/i/(x)r)) 
respectively, which are asymptotically negligible. Hence, the queue dynamics of each link 
becomes a controlled birth-death process with the transition probability of each link given by 

Pr [xi{t + l) = iHiit + l),Qi{t + l))\xi{t) = {Hi{t),Qi{t)),Qi{x{t))] 

Pr[HKt + l)\Hi{t)]Xir, Qi{t + 1) = Qi{t) + 1 

Pr[HKt + l)|HKt)]7i;(x(t))r, Qi{t + 1) = Qi{t) - 1 • (51) 

Pr[HKt + (1 - JlMt))r - V) , Qi(t + l) = Qiit) 

With the above assumptions, the optimization problem in ( |47| ) can be transformed into 



mm 



L / Nf 
1=1 \m=l 



T 



vm=l / , 



(H,Q) 
(52) 



s.t. (15) is satisfied, 



where 



Ay(QO = E 



H 



E 



H 



niH, 



Using standard optimization techniques, the subcarrier and power allocation is given by 

wA^m 1 



P;,^(H,Q) 
S;,m(H,Q) 



s/,^(H, Q) 



7/ 



Lm 



1, if Xi^m = maXj [Xj^rn] > 

0, otherwise 



(53) 
(54) 



where 



X, 



Lm 



AV{Qi)logil + \Hi,^\^{^ 



Ni — - - V ■ ' ' -fi \Hi^m?' J ^ -fi \Hi. 

Hence, the control action calculation (Step 2 of Algorithm [T]) and per-link potential update (Step 
3 of Algorithm [1]) are given below: 

• Control Action Calculation: Based on the realtime observation of the system state x(^) = 



(H(t), Q(t)), perform subcarrier and power allocation according to ( [53] ) and ([54]) at the t-th 
slot. In distributed implementation, each user maintains its own per-link potential, i.e., the 
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l-th user maintains the potential {V5(x[)|l < r < According to (54), the subcarrier 



allocation can be determined distributively by an auction mechanism. For example, each 
user / submits a bid m on each subcarrier m, and the user with the largest bid will get 
the subcarrier. When the subcarrier allocation is determined, the power allocation for each 



link can be calculated locally at each user according to ( [53] ). 
• Per-link Potential Update: Suppose in the t-th time slot the system is in the reference state 
t), i.e., = t), the l-th user will update the per-link potential Vi{xi) according 

to 

Np 

ViMxJ) = ViAxJ) + ecii,r,t) [{i^iQiit) + 7/ + Vi^lQiit) = Nq] 

m=l 

I I 

(55) 

Remark 6 (Implementation Considerations): Note that we choose the reference state as — 
{x\iX\i'" iX\} with x] = ()^]^Q])^ where H^^ can be any fixed local CSI while the local 
QSI Q] = (i.e., the buffer is empty). Hence, each source node of the l-th link requires 
only the local CSI H^, the local QSI as well as some potential functions of the other 
links |e Vi>(Hj,,Qi')\H}, Qi> = 0, l| in order to compute the update in ([55]). While the 
computational complexity and signaling overhead have been substantially reduced compared with 
the brute-force centralized solution, the computation and the overhead of delivering the terms 



E 



V,{Ili„Q,)\Ill 



Qi' = 0,1 j to all the nodes are still quite heavy. In the next section, we 
elaborate on a second approximation approach which could further simplify the complexity and 
overhead. ■ 



D. Approach 2: Approximating Q-Factors 

In this section, we propose another approach to address the complexity issue and signaling 
overhead issue by approximating Q-factors. Different from the approach of approximating the 
potential function, this approach could establish a totally distributed learning algorithm at each 



node of the system. From the Bellman equation of the delay-optimal MDP in ( 142] ), the Q-factor 
is defined as 
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where a is an arbitrary action in the action space A. Hence, we have 

V{x) = minQ(x,a), Vx, 

a 

and a) satisfies the following "Q-factor form" of the Bellman equation 

Qix, a) = gix, o-) + Y] ^Ax'lx, a] min b) - 9, Vx- (57) 

' ' b 

X' 

Moreover, the optimal control policy is given by 

fi*(x)=minQ(x,a), Vx- (58) 

aeA 



As an illustration, we consider the uplink OFDMA system example similar to Section V-C 



Similar to Section |V-C[ we approximate the Q-factor in ( |56| ) by a linear approximation given 
by: 

L 

Qix,a) ^^qiiXhai), Wx, (59) 

where a; denotes the local actions (such as the local subcarrier allocation, local power allocation, 
precoder design, etc) of the /-th link (thus a = {ai}), qi{xhO'i) is referred to as the per-link Q- 
factor for the /-th link of local system state xi ^iid action a;. Moreover, the per-link Q-factor is 
defined as the solution of the following per-link fixed-point equation: 

(li{.Xh ai) = giiXi, ai) + X] P^WI^'' "-iW^iXi) - (60) 



where 



Wiixi)='Eu, 



min [qi {xh{si,r,i = I(|i^i,m| > Hl_^)},ai)] 

ai\si 



9i{Xh ai) = ^iQi + Pi,m + Vi'^iQi = Nq], 

m=l 

s« = {si,m|Vm G {1,2, ■■■ ,Np}^, ai\si denotes all the control action except the subcarrier 
selection s^, and Hl_^ denotes the largest order statistic of the L — 1 i.i.d. random variables, 
each of which has the same distribution as the channel fading. Therefore, an online Q-Learning 
algorithm for estimating per-link Q-factors is given below: 
Algorithm 2 (Online Algorithm for Estimating Per-Link Q-factors): 

• Step 1 (Initialization): Start with an initial per-link Q-factor {qifl{xh o-i)} with qifiixi-, of) = 

(V/ G C). 
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Step 2 (Calculate Control Actions): Based on the realtime observation of the system state 
xit) at slot t, calculate control actions according to: 

L 

o-t = {o-lt^ o-h^ ■ ■ ■ >«L,i} = argminQf(x(t),a) = aigmmS^qiAxiit), ai)- 

^ ^ 1=1 

Step 3 (Update Per-Link Q-Factors): After the control action has been determined, update 
all the per-link potentials {qiixh'^i)} (^^ ^ ^) based on the real-time observations of the 
local per-link system state Xi{^) (^^ ^ where Xi{^) = (Hi(t), QK^)) follows: 

(li,t+i{xl, ai) =(li,t{xj, «/) + ^ci{T,a,,t) {giixl, ai) + Wi^tiXiit + 1))) 

- {9iixi,ai) + Wi,tixiiii + '^)) -qiAxJ,ai)Y[ixJ,ai) = ixiit),al)], 

(61) 

where ci{T,ai,t) = ^l/=oi[{xh^i) — ixi{t')^(^'it')] '^he number of updates of the state- 
action pair {xj, ai) up to slot t, aj) denotes the reference state-action paiip°| of the l-th 
link and ii = sup{t|(x/(t), a^J = 

Step 4 (Termination): If J2iec 11^':* ~ < ^q^ stop; otherwise, set t := t + 1 and go 

to Step 2. 



Remark 7 (Implementation Considerations): Using the linear approximation in (59), the di- 
mension of the Q-factor (and hence the computational complexity) is significantly reduced. 
Furthermore, the online update procedure in step 3 can be implemented locally at each node, 
requiring only knowledge of the local CSI H; and the local QSI Qi. ■ 



Similarly, using the theory of stochastic approximation on the update equation (61 1 in Step 3, 
we summarize the convergence of the above online learning algorithm as follows: 
Lemma 4 (Convergence of Algorithm^: Denote = {qiiXhC'i)} ^iid 

At-i = (1 - et-i)I + F{nt)et-i Bt ^ = (1 - Q-i)I + F{nt-i)et-i, 

where ^It is the unichain system control policy at the t-th frame, F(i7() is the transition matrix 
under the unichain system control policy Vlt, and I is the identity matrix. If for the entire sequence 
of control policies {fif}, there exists an 5f > and some positive integer (3 such that 

[A^_i ■ ■ ■ Ai](i_/) > [B^_i ■ ■ ■ Bi](ij) > 5t Va, 

'"From l |6T| , we can observe that qi.t{xi)<^i) = <li,o{xi ^ o-i) = (Vt > 0). 
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where [■](«,/) denotes the element in the i-th row and the J-th column and 6t = 0{et), then the 
following statements are true: 

• The update of the per-link Q-factor will converge almost surely for any given initial per-link 

Q-factor {q^}, i.e., lim qi^t = q;,oo a.s. (V/ G £). 

t— ^+00 

• The steady state per-link Q-factor satisfies: 

qi,oo{Xh ai) = giiXh ai) + Yl P^i^il^'' (^iWi,oo{Xi) - 01, 
where 9i is a constant and 

Wi^ooixi) = Eh, \ min [qi^^o (xi, {si,m = I(i^"«,m > Hl_^)}, aA] \ . 

lo,i\si J 

■ 

The proof of Lemma |4] follows a similar approach as the proof of Lemma [3| In the following, 
we elaborate Algorithm [2] using the uplink OFDMA system example. 

Example 5 (Approximating Q-Factors for Uplink OFDMA Systems): Consider the uplink OFDMA 
system example under the same assumptions as in Example |4j Since the power control can 
be determined locally given a subcarrier allocation action, the per-link Q-factor is defined as 
{(li{Xu^i)\^Xu^i] (V/ e £), satisfying 

qiiXu^i) = min \ giixh^i) + V Pilx'ilXh Si]Wi{x'i) - Oi } (62) 



mm 

{pi 



< ^iQi + + = + ^'(H^' + ^^'(Hz, Qi + 1)] 

' m=l 

^Wiixi) (^^Si,„log(l+p/ ,iri 



T 



\m=l 

Where W^(H,, Qi) = Eh; [Wi{Yl\, Qi) \ (H^, Qi)] and AF(H,, Qi) = W{Yli, Qi)-W{Yli, Qi-l). 
Due to the symmetry of each subcarrier and the birth-death queue dynamics, the per-link Q-factor 



satisfying ( [62] ) can be written as the summation of the per-link per-subcarrier Q-factors 



m=l 
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where the per-link per-subcarrier Q-factor {qi{H,Q, s)} satisfies the following per-link per- 
subcarrier fixed-point equation: 

<ll{Hl^rn, Ql, Si^m) 

. j Ql , Vim = NQ] , ElUHHi,m,Qi) + Al7i{Hi,^,Qi + l)) 
= mm <i^i— + im,m H 77 h 



F 



~ = {y^^ ^'^l{Hl,m,Ql))si,m^Og{l + Pi^ra\Hl^m?) - — ^ 

JSl l^F 
m=l 

where Ui{Hi^rn,Ql) = E[qi{Hi^rn,Qh = l[\Hi^m\ > Hl_-^])\{Hi^rn,Ql)], ^l{Hl^m,Ql) = 

EHi Jjyi{Hlm^Qi)\{Hi.m,Qi)] and AVi{Hi^^,Qi) = Vi{Hi^^,Qi) -ui{Hi^^,Qi - 1). According 
to ( [58] ), the subcarrier allocation is given by 

{1, if qi{Hi^rn, Ql, Sl^rn = 1) = Hlinj qj{Hj^rn, Ql, Sj,m = 1) 
(63) 
0, otherwise 

Moreover, given the subcarrier allocation, by the optimization techniques, the power allocation 
is given by 

f wXlU^HHi,r.,Qi) 1 \^ 

Pi,Uii, Q = si,m — rn—i^ ■ (64) 

\ ii \Hi,mr J 

Hence, the control action calculation (Step 2 of Algorithm |2]) and Q-factor update (Step 3 of 
Algorithm [2]) are given below: 

• Control Action Calculation: Based on the realtime observation of the system state x(t) = 
(H(t),Q(t)) at the t-th slot, we determine the subcarrier allocation distributively by an 



auction mechanism according to ( [63] ): each user / submits one bid qi,t{Hi^rn,Qi, si^m = 1) 
on each subcarrier. The user with the minimum bid will get the subcarrier. Given the 
subcarrier allocation, the power allocation can be calculated locally at each user / according 



to ( |641 ). 

• Q-Factor Update: After the control action is determined, update all the per-link per- 
subcarrier potentials {qi{H,Q, s)} (V/ G C) based on the real-time observations of the 
local per-link system state x/(t) = (H/(t), (^/(t)) as follows: 

qi,t+i{H, Q, s) =qLt{H, Q, s) + ect(Q,H,s,t) + liPi,m{t) + + w^{Qi{t + 1))) 

I\ F 

- mAQiiti + 1) - %tiH, Q, s)]l[U^l,{{H, Q, s) = (Hi^Ut), Qiit), }], 

(65) 
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where ci{H,Q,s,t) = Yll=o'^l^rn{{H,Q, s) = iHi^rait'),Qi{t'), si^rnit'))}] is the number 
of updates of the per-subcarrier state-action pair (H, Q, s) up to slot t, {H^ , , s^) de- 
notes the reference state-action pair of each link, and ii = sup{t| Um {{H^ ,Q/ ,s^) = 

{Hi^Ut'),Qi{t'),si,Ut'))}. 

Notice that the above Q-factor update requires only the local information at each user, and 
it does not lead to any signaling overhead. 

■ 

Remark 8 (Comparison of the Two Approximate MDP Approaches): Both of the approximate 



MDP approaches in Section V-C and Section V-D can effectively reduce the complexity and 



signaling overhead in the MDP solutions. However, there are pros and cons in the two approaches: 

• In general, approximating potential functions will lead to fewer dimensions (and lower 
memory requirement) than approximating Q-factors. This is because a Q-factor depends on 
both the system state and the control action. 

• In the distributed implementation of the online learning algorithm, updates of the per-link 
Q-factor can be done locally without any signaling overhead, whereas updates of the per-link 
potential function still require some information exchange among the nodes. 

• In some cases, computing actions from potential functions may still be complicated com- 
pared with computing actions from Q-factors. 

Although approximate MDP and stochastic learning can effectively reduce the complexity in 
the MDP solution, extension to multi-hop networks is far from trivial due to the complex 
interactions of the queue dynamics and the huge state space involved. More investigations are 
needed regarding how to approximate potential functions or Q-factors as well as the associated 
convergence proof in multi-hop networks. ■ 

VI. Delay- Aware Routing in Multi-hop Wireless Networks 

In this section, we focus on delay-aware routing in wireless multi-hop networks using the 
Lyapunov stability drift approach. Due to the complex coupled queue dynamics in multi-hop 
wireless networks, extensions of the equivalent rate constraint approach and the approximate 
MDP approach to multi-hop networks are highly non-trivial. On the other hand, the Lyapunov 
drift approach can be easily applied in multi-hop networks to derive dynamic control algorithms 
that are adaptive to both the system CSI and the system QSI. Hence, the Lyapunov drift approach 
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Fig. 4. Illustration of a single packet taking a periodic walk under traditional DBP routing in a network. 



is receiving more and more attention recently in multi-hop networks. In the following, we first 
review the traditional DBP routing in wireless multi-hop networks and then focus on various 



delay reduction techniques in the enhanced DBP routing p6|, [57|-[64| 



A. Traditional DBP Routing 

The traditional DBP routing in wireless multi-hop networks was originally proposed in the 



seminal paper [42| to maximize the stability region and then extended in [jl3|, p6|. This 



traditional DBP routing is illustrated below [13|, |26|: 
Algorithm 3 (Traditional DBP Routing): 

• Resource allocation: For each commodity c E C and each link I E C, define the backpres- 
sure of link I w.r.t. commodity c as 

^Q't\t) = Q%{t)-Q%)it)- (66) 

For each link I E C, define the optimal commodity of link I as (t) = arg maxcgc AQ[^^ (t) 
and the optimal backpressure of link I as AQ'^{t) = mdJ<.cec{AQl'^^*^\t),0}. Find the 
transmission rate such that 

H*{t) = arg max V Ag;(t)/i,. (67) 

• Routing: For each link / such that A(5^(t) > 0, offer a transmission rate /i^(t) to the data 
of commodity c^(t) through link /. 
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Fig. 5. Illustration of differential backlogs under traditional DBP routing in a tandem queueing network. 



A significant weakness of the above DBP routing algorithm is that it can suffer from very 
large delays due to the following reasons. 

• First, the traditional DBP routing exploits all possible paths between source-destination 
pairs (i.e., load balancing over the entire network) to maximize stability region without 
considering the delay performance. This extensive exploration is essential in order to main- 
tain stability when the network is heavily loaded. However, under light or moderate loads, 
packets may be sent over unnecessarily long routes, which leads to excessive delays. For 
example, if a single packet is injected into an empty network, there is no backpressure 
to suggest an appropriate path. Hence, the packet might take a random walk through the 
network, or might take a periodic walk that never leads to the destination, as illustrated in 
Fig. |4j In this case, although the network congestion is quite low (only one packet, i.e., 
zero average arrival rate), the end-to-end delay can be infinity. Similarly, under light load, 
the end-to-end delay can be large even though the average queue length is guaranteed to 
be bounded by the Lyapunov drift theory p3| , p6| . Therefore, it is desirable to design a 
throughput-optimal routing which exploits longer paths only when it is necessary. 

• Second, due to large queue sizes that must be maintained to provide a gradient (backpres- 
sure) for each data flow, the DBP routing can suffer from very large delays and the queues 
grow in size with distance from the destination. To obtain some insights on this, let us 
consider a flow traveling through a A^-hop tandem queueing network with + 1 nodes, as 
illustrated in Fig. |5] Let Qq be the queue length at destination node and Qn be the queue 
length of the n-ih. upstream node from the destination node 0, where n = 1, 2, ■ ■ ■ , N. 
Set Qo = 0. Under the traditional DBP routing algorithm, for a link to be scheduled, the 
differential backlog associated with it should be positive. Thus, Qi — Qo = Qi will be some 
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positive number, say e, and Q2 — Qi will be even larger than e. For the purpose of obtaining 
some insights, let us informally assume Q2 — Qi = e, which implies Qi = e and Q2 = 2e. 
Similarly, we can obtain Qn = ne. Thus, the total queue length for the flow under the 



traditional DBP routing wiU be Y.n=i Q„ = e(l + 2 + ■ ■ ■ + A^) = 0{N'^) (STj. Therefore, it 
is desirable to design a routing algorithm which can provide a sufficient gradient for each 
data flow without causing too large delay for each packet. 
• Finally, the traditional DBP routing specifies a single next-hop receiver before transmission, 
and hence does not exploit the broadcast advantage of multi-hop wireless networks when 
wireless channels are unreliable (e.g., outage probability without CSI). Due to multi-receiver 
diversity in wireless channels, the probability of successful reception by at least one node 
within a subset of potential receivers is much larger than that of just one receiver. Therefore, 
it is desirable to design flexible routing to dynamically adjust routing and scheduling 
decisions in response to random outcome of each transmission. 
Given the drawbacks of the DBP routing discussed above, most recent studies try to improve 
the delay performance of the DBP routing while maintaining its advantage in throughput opti- 
mality. In the following, we discuss three aspects of delay reduction in DBP routing by utilizing 
the shortest path concept p3| , p6| , p7| , [|58|, modifying the queueing disciplines p9|-[[6T| and 



exploiting receiver diversity over unreliable channels [62|-[64| in wireless multi-hop networks. 



B. Delay Reduction in DBP Routing by Shortest Path 

One of the major reasons for the poor end-to-end delay performance of the traditional DBP 
routing algorithms is the extensive exploration of routes. However, reducing delay by restricting 



the routing constraint sets to some shorter paths will reduce the stability region. In [13|, [26|, 



[57 1, [58 1, the authors try to incorporate the idea of shortest path routing into traditional DBP 
routing algorithms in different ways while simultaneously maintaining throughput optimality, 
which will be illustrated in the following. 



The enhanced DBP routing algorithm proposed in [26| programs a shortest path bias into the 
backpressure /S.Qf\t) defined in ( [661 ) that in light or moderate loading situations, nodes are 
inclined to route packets in the direction of their destinations. Therefore, we call this enhanced 
DBP routing algorithm shortest path bias DBP routing. Specifically, the backpressure of link I 
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w.r.t. commodity c in the enhanced DBP routing algorithm is defined as 

AQ!^^ it) ^ {q% (t) + Z%) - [q% it) + 4) ) , (68) 

(c) (c) 

where Zn is the shortest path bias at node n for commodity c. Zn can be chosen to be 
proportional to the distance (or number of hops) between node n to the destination of commodity 

(c) 

c (where = if node n is the destination of commodity c). Besides the shortest path bias, 
the shortest path bias DBP routing algorithm is the same as the traditional DBP routing algorithm 
in Algorithm [3j It is shown that the enhanced DBP algorithm through the shortest path bias is 
still throughput-optimal. In addition, the simulation results establish better delay performance of 



the shortest path bias DBP routing than the traditional DBP routing [13|. 

To reduce the end-to-end delay, p7| introduces a cost function, i.e., the total link rate in the 
network. Given a set of packet arrival rates that lie within the stability region, the total link rate 
can be used to measure the efficiency of the system resource utilization. Thus, the min-resource 
routing problem is formulated to find the routes, which minimizes the total link rate: 

ie£ cec 

le{l:d{l)=n} l&{l:s{l)=n} 

Due to the nature of the cost function, shorter paths are preferred over longer paths. For example, 
in a network with all links of equal capacity, we prefer to have as few hops as possible to have 
good delay performance. The associated routing algorithm is called min-resource DBP routing 



algorithm. Instead of ( [66| ), it uses (|66]) minus a parameter V as the backpressure of link I w.r.t. 
commodity c, i.e., 

^Q^\t)^Q%{t)-Q%p)-V. (70) 

Except for parameter V, the min-resource DBP routing algorithm is the same as the traditional 
DBP routing algorithm in Algorithm[3} It is shown in |57l that the average total link rate under the 
min-resource DBP routing algorithm is within 0{1/V) of the optimal value of the optimization 



problem in ([69]). A larger V corresponds to a smaller delay and slower convergence speed to 



the stationary regime while a smaller V leads to a larger delay and faster convergence speed. It 



is confirmed by simulation results in [57 1 that the min-resource DBP routing algorithm with a 



proper V has better delay performance than the traditional DBP routing algorithm. 
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The joint traffic-splitting and shortest-path-aided DBP routing algorithm proposed in [!58| 
incorporates the shortest path concept into the DBP routing by minimizing the average number 
of hops between sources and destinations. Let c G C denote a flow (source-destination pair) in 
the multi-hop network, which is specified by its source and destination, where C denotes the 
set of all the flows. Let Ac^h denote the fraction of flow c transmitted over paths with h hops. 
Therefore, the average path-length minimization problem is formulated as belowp] 

min V V VhA,^h, (71) 
Mch} — — 

' ceC Q<h<N-l 

where V is a positive constant and the optimal solution is the same for all V > 0. Note that 
— 1 is a universal upper bound on the number of hops along loop-free paths. To realize the 
shortest-path-aided routing, each node n maintains a separate queue (n, (i(c), h) for the packets 
required to be delivered to node d{c) within /i G N hops and denote its queue length at slot t 
as Q^nhi^)^ where N denotes the set of natural numbers. Accordingly, define the backpressure of 
link I w.r.t. flow c and hop h as follows: 

where H^^^_^^ is the minimum number of hops, i.e., the length of the shortest path, required 
from node d{l) to the destination of flow c. Based on /\Q^^l{t), define the backpressure of link 
I w.r.t. flow c as 

/^Qf\t)^ max AQitlt). (73) 

HTn <h<N-l 

a(/)— >c — — 

The joint traffic-splitting and shortest-path-aided DBP routing algorithm proposed in f5W\ consists 
of two parts. For traffic splitting, at time t, the exogenous arrivals of flow c are deposited into 
queue {s{c),d{c),h*{t)), where h*{t) = argmino</i<Ar-i h + Q^f^''^^^l{t)^ . The shortest-path- 
aided DBP routing is the same as the traditional DBP routing with AQl'^\t) defined in fTS] ). 



It is shown in [58| that the joint traffic -splitting and shortest-path-aided DBP routing algorithm 
is throughput-optimal and solves the average path length minimization problem in ( |7T| ) when 
F — 7- oo. This enhanced DBP routing algorithm achieves significant delay improvement over the 
traditional DBP algorithm. 

"We omit the constraints of the optimization problem in due to page limit. Please refer to 58 for details. 
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C. Delay Reduction in DBF Routing by Modified Queueing Discipline 

The traditional DBP routing maintains large queue lengths at nodes (especially those far 
from the destination nodes) so as to form gradients for data flows. This guarantees throughput 
optimality while leading to poor delay performance. In the following, we introduce the algorithms 



proposed in [59|-|61 1, which try to maintain the gradients for data flows in DBP routing while 
reducing delay for most of the packets. 

In [[59l, the proposed /a^f quadratic Lyapunov based algorithm (FQLA) can achieve [0{1/V), 
0{\og^{V))] utility-delay tradeoff, which is greatly improved compared with [0{1/V),0{V)] 
utility-delay tradeoff of the traditional DBP routing algorithm (also called quadratic Lyapunov 



based algorithm (QLA) in [59|). In |59|, the authors show that under QLA, the backlog vector 
"typically" stays close to an "attractor" and the probability of the backlog vector deviating from 
the attractor is exponentially decreasing in distance. Based on this "exponential attraction" result, 
FQLA subtracts the attractor to form a virtual backlog process and applies the traditional DBP 
routing based on the virtual backlog process with slight modification by allowing packet dropping 



under certain conditions. It is shown in |59| that the FQLA is throughput-optimal and the packet 
drop fraction is With the sacrifice of packet dropping, the FQLA improves the 

utility-delay tradeoff. 

LIFO DBP routing is first proposed in the empirical work [ [60| by simply replacing the FIFO 
in the traditional DBP routing with the LIFO service discipline. The authors in [ |60[ show that 
LIFO DBP routing drastically improves average delay by simulations. Using the "exponential 
attraction" result developed in p9| , Neely shows in [ [6T| that the LIFO DBP routing algorithm 
can achieve [0{l/V), 0{\og^{V))] utility-delay tradeoff for almost all the arrival packets except 
0{1/V^°^^) fraction of the arrival packets. The reason is as follows. The FIFO and LIFO DBP 



routing result in the same queue process. By the "exponential attraction" result in [59|, the queue 
size under DBP routing will mostly fluctuate within the interval [Qlow, Qnigh], the length of which 
is shown to be 0{\og^{V)). The queue process deviates this region with probability exponentially 
decreasing in distance. Using LIFO, most packets (except 0{1/V^°^^) of the arrivals) enter and 
leave the queue when the queue length is in [Qlowj Qnighl^ i-C-^ they "see" a queue with average 
queue length about Qnigh — Qlow = 0{\og^{V)). Therefore, the average delay of these packets 
is greatly reduced with the penalty that the packets of fraction 0{1/V^°^^) of the arrivals at the 
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front of the queue suffer from large delay and have to be dropped. 

D. Delay Reduction in DBF Routing by Receiver Diversity 

Under unreliable channel conditions, the traditional DBP routing, which makes routing deci- 
sions before each transmission, fails to exploit multi-receiver diversity in wireless networks. In 
the following, we discuss the routing algorithms, which use the receiver diversity under unreliable 



channel conditions by routing packets to the successful receivers after each transmission |62|- 



|64| 



The ExOR proposed in [65 1 is a shortest path routing algorithm which uses expected transmis- 
sion counting metric (ETX) as the metric of link cost and chooses the receiver with the minimum 
ETX after each transmission. Thus, it can achieve better delay performance than the shortest 



path routing algorithm using ETX with routing decision made before transmission [66 1. However, 



ExOR is not throughput optimal. In [ |62[ , [ |63| , the authors propose the opportunistic routing with 
congestion diversity (ORCD) algorithm for multi-hop wireless networks with multiple sources 
and a single destination. ORCD is a shortest path routing algorithm with the queue length based 
congestion measure as the path length metric, and routes the packets along the paths with the 
minimum congestion after transmission. It is shown that ORCD is throughput-optimal. 

Diversity backpressure routing (DIVBAR) algorithm proposed in [64] is a DBP routing algo- 
rithm exploiting receiver diversity in multi-hop wireless networks with multiple sources and mul- 
tiple destinations. In DIVBAR, the backpressure of each node n w.r.t. commodity c, i.e., /\Qn\t), 
is defined as the success probability weighted sum of ^Qf\t) defined in ^ over all link / with 
s{l) = n. Then, the optimal commodity of node n is defined as c*(t) = argmaxcgc 
The resource allocation of DIVBAR based on c* (t) is similar to that of the traditional DBP 
algorithm, while packets are routed to the receiver with the largest positive AQ['^"^^''\t) among 
all the successful receivers after each transmission. Like traditional DBP routing, DIVBAR is 
throughput-optimal. 

VII. Comparisons 

In this section, we compare the three approaches in dealing with delay sensitive resource 
allocation using the uplink OFDMA system example as illustration. 
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A. Comparison of Solution Structures and Complexity 

In general, the solution obtained by the first approach (equivalent rate constraint) is adaptive 
to the CSI only. On the other hand, the solution obtained by the second approach (Lyapunov 
stability drift) and the third approach (MDP) is adaptive to both the CSI and the QSI but the 
MDP approach has higher complexity. Using the uplink OFDMA system as an example, the 
solution structure of the second and third approaches are quite similar. For the Lyapunov drift 
approach, the solution is obtained by the one-hop dynamic backpressure algorithm (M-LWDF) 
in Example |2] with the following optimization problem formulation: 



max 



L / Nf \ 

'"^ 1=1 \m=l J 



V(H,Q) eUxQ 



(74) 



s.t. si^m e {0,1}, V/ e C,m e {1,2,- ■■ ,Nf}, 



N 



MleC 



■ Nf 
m=l 



< Pi, V/ G C. 



For the MDP approach, the solution is obtained by solving the Bellman equation (42 1 in Example 
|4] with the following equivalent optimization problem formulation: 



max 



L ^ / Nf \ 

. ^ ^AV{Qi) Yl + Pi,m\Hi,^\') , V(H, Q)e'HxQ (75) 

1=1 '■ \m=l / 



s.t. Sz,^e{0,l},V/G£,mG{l,2, 

< Pi, V/ e £. 



N 



1, yieC 



■ Nf 
m=l 



Observe that the M-LWDF problem in ( |74| ) is very similar to the MDP problem in f75] ) except 
that the weighj^for the l-th link (i.e., throughput of the /-th MS) in the later case f75| ) is given by 



the potential function AV{Qi) whereas the weight in the former case ( [74] ) is given by the queue 
state Qi. The subcarrier allocation allocation in all the three approaches will select the subcarrier 
with the highest metric. The metric in the equivalent rate constraint approach is a function of the 
CSI only whereas the metrics in the other two approaches are functions of the CSI and the QSI. 

'^Note that the factor =- represents the transformation from packet flow (considered in Example to bit flow (considered 
in ) Example j2j 
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The power control in all the three approaches have a similar form of power water-filling w.r.t. 
the CSI. However, the water-level of each link in the equivalent rate constraint approach only 



adapts to the Lagrange multiplier corresponding to the average delay constraint in ( fT7| ) of each 
queue. In other words, the water-levels of different links are different in general (with different 
average delay requirements), while the water-level of the same link remains constant during 
different realtime system state (H(t), Q(t)) realizations. However, in the other two approaches, 
the water-level of each link varies according to realtime system state (H(i(:), Q(t)) realizations. 
Specifically, in the Lyapunov drift approach, the water level of the Z-th link is determined by 
the QSI Qi{t). In the MDP approach, the water level of the Z-th link is determined by the QSI 
via the potential function /SV{Qi{t)), which is obtained by solving the Bellman equation in 
(|42]) of Example |4j As a result, the major difference between the second approach and the third 
approach is on the calculation of weight. In the third approach, additional processing is involved 
to compute the potential functions and this contributes to additional complexity. 



B. Comparison on Distributed Implementation 

In this part, we discuss the feasibility of the distributed implementations using different 
approaches. First of all, using the first approach, the optimization problem will be transformed 
into a standard convex optimization problem (such as in Example [T]). As such, traditional primal- 
dual decomposition techniques may be used to explore distributed implementations. For example, 
in Example [T] the subcarrier allocation can be done by a distributed auction mechanism, and 
the power allocation at each user can be calculated locally according to the auction results 
and the local CSI. Readers could refer to [|F7l| for a survey on the decomposition method. 
On the other hand, using the second approach, the one-hop Dynamic Backpressure Algorithm 
(M-LWDF) only requires the local system state information at the transmitter and therefore, 
the M-LWDF problem can be solved distributively. For example, in Example |2| the subcarrier 
allocation can be done by a distributed auction mechanism where the auction bid is determined 
by the local CSI and the local QSI, and power allocation at each user can be calculated 
locally according to the auction results and the local system information. However, in multi-hop 
networks, the QSI of the neighboring nodes is required at each node, raising additional signaling 
overhead on the distributed implementation. Finally, using the third approach, the obstacle of 
the distributed implementation comes from the potential function V{x) and the transition kernel 



October 21, 2011 



DRAFT 



48 



term Pr[x'|x, In general, these terms are not decomposable and this poses an additional 

challenge (compared with the second approach) of getting a distributed solution using the MDP 



approach. In Section V-D, we have illustrated that by approximating potential functions or Q- 
factors as the sum of per-link potential functions or Q-factors, the distributed solution can be 
obtained via an auction mechanism. 



C. Comparison of Performance 

In this section, we compare the performance of the three approaches using the uplink OFDMA 
system example. For simplicity, we assume = 3, and the buffer length Nq = 5 (packets). 
The scheduling slot duration r = 1 ms. All the users have the same average Poisson packet 
arrival rate A = 3 (packets/s), and exponential packet size distribution with mean packet size 
= 5000 (bits/packet). The total bandwidth is assumed to be lOMHz, with 1024 subcarriers 
and 5 independent subbands. 

Fig. [6] compares the average delay performance of the three approaches under the same average 
power constraints. It can be observed that the delay performance of the MDP approach is better 
than those of the equivalent rate constraint approach and the Lyapunov drift approach in the entire 
operating regime. Furthermore, Fig. |6] also illustrates that the performance of the approximated- 
MDP approach is very close to the brute-force MDP solution. As a result, the approximated-MDP 
approach is an acceptable way to reduce the complexity and achieve a near optimal performance. 
On the other hand, the equivalent rate constraint approach (CSTonly policy) is the simplest 
solution but the gap in the delay performance is small only in the very large delay regime. The 
delay performance (and the complexity) of the Lyapunov stability drift approach is between 
those of the CSI-only approach and the MDP approach. 

Fig. [7] compares the delay performance of the three approaches with different number of users. 
The average transmit SNR for each user is 17.75dB. Similar observations about the performance 
and the complexity of the three approaches can be made. 

Fig. [8] illustrates the convergence property of the approximate MDP approach using distributed 
stochastic learning. We plot the average per-link potential functions of the 3 users versus the 
scheduling slot index at a transmit SNR=10dB. It can be seen that the distributed algorithm 
converges quite fast. The average delay corresponding to the average per-link potential functions 
at the 500-th scheduling slot is 5.9, which is much smaller than those of the other baselines. 
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Average Power PQ(dB) 



Fig. 6. Comparison of the delay performance of the equivalent rate constraint, Lynapunov Stability Drift and MDP approaches 
under the same average power constraints. The packet arrival rate is A = 3 (packets/s) with average packet length TV = 5000 
(bits). The average packet drop rates of all schemes are 1%. 
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Fig. 7. The average delay per user versus the number of users. The average transmit power for each user is 17.75dB, and the 
average Poisson packet arrival rate is A = 1.5 (packets/s) with mean packet size A'^ — 5000 (bits). The packet drop rates for all 
the schemes are 1%. 
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Fig. 8. Illustration of the convergence property for the distributed online learning algorithm of the approximated MDP approach. 
Parameter vector versus the iteration index with the average transmit SNR 14.3dB. The average Poisson packet arrival rate is 
A = 3 (packets/s) with mean packet size A*' = 5000 (bits). 



VIII. Summary 

In this paper, we have introduced three major approaches, namely the equivalent rate constraint 
approach, the Lynapnov stability drift approach as well as the MDP approach, to deal with delay- 
aware resource allocation for wireless networks. For the MDP approach, we use the approximated 
MDP and stochastic learning to solve the curse of dimensionality and facilitate distributed online 
implementation. Moreover, we also elaborate on how to use these approaches in an uplink 
OFDMA system. It is shown by simulations that the equivalent rate constraint approach performs 
better than the Lynapnov stability drift approach in the large delay regime and worse in the small 
delay regime, and the MDP approach has much better delay performance than the other two 
schemes in all regimes. 
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Appendix A: Proof of Lemma [3] 
Since each representative state is updated comparably ofterT^ in the asynchronous learning 



algorithm, quoting the conclusion in [69|, the convergence property of the asynchronous update 



and synchronous update is the same. Therefore, we consider the convergence of the related 
synchronous version for simplicity in this proof. 



Let c G -R be a constant, we have (TcMVi)(/) = c(TMVt)(/). Similar to [70|, it is easy to 



see that the parameter vector {V^} is bounded almost surely during the iterations of the algorithm. 
In the following, we first introduce and prove the following lemma on the convergence of learning 
noise. 

Lemma 5: Define 

qt = g{nt) + F(fiOMV, - MYt - (TMVi)(/)e , 

when the number of iterations t > j — )■ oo, the update procedure can be written as follows with 
probability 1: 

t 

Vi+i = Vj + J^efQi. 

i=j 

Proof: The update of parameter vector can be written in the following vector form: 



+ JtMYt - MVt - g{I, ^t) + (MVi)(/+) e 



where the matrix 3t (with exactly one element of 1 in each row) denotes the realtime observed 
state transition from the t-th frame to the t + 1-th frame, and /+ denotes the observed next state 
if the current state is /. Define 



Y, = M 



-1 



;{nt) + F{nt)MVt - MYt - (TMVO(/)e 



and 6Zt = Yt — qt and Zt = Yl Cj'^Zj. The online potential estimation can be rewritten as 



Vt+i = ^t + etYt 

= + etqt - etSZt 

t 

= Vt + ^eiq^-Zt. 



(76) 



"please refer to 



for the definition of "comparably often". 
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Our proof of Lemma |5] can be divided into the following steps: 

1. Letting Ft = o"(V,m,m < t), it is easy to see that E[(5Zt| J^^^i] = 0. Thus, {(5Z(|Vt} is a 
Martingale difference sequence and {Zf|Vt} is a Martingale sequence. Moreover, Yt is an 
unbiased estimation of qt and the estimation noise is uncorrelated. 

2. According to the uncorrelated estimation error from Step 1, we have 



E 



7 j-i 



E 



EE 



7 ,-1 



where Z > maxE |5Zjp jr,_i 

Z from the definition of sequence {cj}. 

i=i 



= Z^(ei)^ -> when j oo, 

i=j 

is a bounded constant vector and the convergence of 



3. From Step 1, {Z(|Vt} is a Martingale sequence. Hence, according to the inequality of 
Martingale sequence, we have 



Pr 



sup |Zj| > A 

lj<i<t 



J-' j-l 



E 



< 



A2 



VA > 0. 



From the conclusion of Step 2, we have 



lim Pr 



sup |Zj| > A 

j<i<t 



J-' j-i 



VA > 0. 



Hence, from (76) we almost surely have V^+i = + J2i=j ^ili when j oo. 



Moreover, the following lemma is about the limit of sequence {qt}. 
Lemma 6: Suppose the following two inequalities are true for I = a,a + 1, 

g(fi,_0 + F(fiz_i)MV;_i < g(fiO + F(fiOMV;_i, 



a + b 



then we have 



LlJ-i 



(77) 
(78) 

(79) 



1=0 
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where denotes the ith element of the vector qa+b, Ci is some constant. 



Proof: From (T7\ and ( |78| ), we have 



q«-i 



where wi = (TMVi)(/). According to Lemma [sj we have 

V/ = V/_i + ei_iq,_i ^ V, = V;_i + e;_iqi_i, 



therefore, 

qz < 
qz > 

Notice that 



:i-Q_i)I + M-iF(fi'-i)MQ_i 



qi-i + Wi-iB - wie = B/_iqi_i + wi^ie - wiS 



:l-Q_l)I + M-lF(^]')MQ 



q;_i + wi^iB - wie = A,_iq,_i + wi_ie - wie. 



Ai^ie = (1 - e,_i)Ie + M-^F{n')Mei^ie = (1 - Q_i)e + Le,_ie 

Bi^e = (1 - en)Ie + M-iF(f]'-i)MQ_ie = (1 - Q_i)e + Lq.ie, 

where L is the total number of links in the network. Notice that Ai_ie = le, we have 

Ai_i...Ai_/3q,_/3 - Cie < qi < Bi^i...Bi^(sqi^i3 - Cie 
(1 - 6i)[mmqi^f3] < qi + Cie < (1 - 5/)[maxq;_^] 
maxq; + Ci < (1 - 6i) maxq;_/3 
min q; + Ci > (1 - 5i) min q/./j 



maxq^ — minq^ < (1 — 5; 



max q;„/3 — mm q;_ 



=^ |g-| < maxq; - minq^ < C2(l - 5i) Vi, 

where the first step is due to conditions on matrix sequence {A^} and {B;}, maxq; and minq/ 
denote the maximum and minimum elements in q/ respectively, Ci and C2 are all constants, the 
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first inequality of the last step is because minq/ < 0. Hence, the conclusion is straightforward. 



Therefore, the proof of Lemma [3] can be divided into the following steps: 

[-J-1 

1. From the property of sequence {ej}, we have ni=o (1 ~ ^i/?) — ^ (t — )■ oo). 



2. According to the first step, note that 5t = 0{et), from (79), we have — ?■ (t — )• oo). 

3. Therefore, the update on {V;} will converge to Vqo, which satisfies the following fixed-point 
equation 

^e + Voo = M-iT(MVoo). 

This completes the proof. 
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